home *** CD-ROM | disk | FTP | other *** search
- Path: pm074-11.dialip.mich.net!user
- From: nkuitse@umich.edu (Paul M. Hoffman)
- Newsgroups: comp.lang.c,comp.sys.mac.programmer.tools
- Subject: Re: HELP ME PLEASE: Thesaurus function / Thesuarus lists needed
- Date: Sun, 21 Jan 1996 10:57:59 -0500
- Organization: University of Michigan
- Message-ID: <nkuitse-2101961057590001@pm074-11.dialip.mich.net>
- References: <4dj5ke$uo2@news.clandjop.com>
- NNTP-Posting-Host: pm074-11.dialip.mich.net
-
- In article <4dj5ke$uo2@news.clandjop.com>, eday@clandjop.com wrote:
-
- >I need to add a thesaurus function to a program. I need some
- >help/guidance in starting this project.
-
- [snip]
-
- >So, here I am, asking: Where can I go, who do I ask, what can I do to
- >get this project going?
-
- I'm doing something similar, using data based on a Project Gutenberg
- edition of Roget's Thesaurus -- this is the last public-domain edition,
- from 1911, plus a few modern-day changes. Here's a description from the
- file:
-
- >*********************************************************************
- >** Thesaurus-1911 **
- >** Version 1.02 (supplemented: July 18, 1991) **
- >*********************************************************************
- ><-- An electronic thesaurus derived from the version of Roget's
- >Thesaurus published in 1911.
- > This thesaurus has been prepared by MICRA, Inc. (May 1991).
-
- [snip]
-
- > Note that this version of Thesaurus-1911 has been supplemented
- >with over 1,000 words not present in the original 1911 edition, but
- >many modern words are still missing. About 1500 verbs (out of 6500)
- >which can be found in an 80,000-word spell-checker are absent from
- >this work.
-
- Here's one place to get it (URL from an Archie search in Anarchie):
-
- ftp://ftp.uu.net//doc/literary/gutenberg/etext91/roget14a.txt.Z
-
- It's 600K or so, 1400K when uncompressed (use MacCompress). It's split up
- by meaning into about 1000 entries, like its modern-day descendant,
- Roget's International Thesaurus. Here's an example of (part of) an entry:
-
- > #45. [Connecting medium.] Connection. -- N. vinculum, link;
- >connective, connection; junction &c. 43; bond of union, copula,
- >hyphen, intermedium[obs3]; bracket; bridge, stepping-stone, isthmus.
- > bond, tendon, tendril; fiber; cord, cordage; riband, ribbon,
- >rope, guy, cable, line, halser|, hawser, painter, moorings, wire,
- >chain; string &c. (filament) 205.
-
- I have a cleaned-up version which I can make available if you're
- interested; it looks like this:
-
- >#45. [Connecting medium] Connection
- ><N>
- > @vinculum, link; connective, connection; junction <etc> 43;
- >bond of union, copula, hyphen, intermedium[<obs>]; bracket; bridge,
- >stepping-stone, isthmus.
- > @bond, tendon, tendril; fiber; cord, cordage; riband, ribbon,
- >rope, guy, cable, line, halser[<pipe>], hawser, painter, moorings,
- >wire, chain; string <etc> (filament) 205.
-
- Once you have the data, you need to do some serious Perl (or some such)
- coding. One (relatively) quick and simple way to make it usable is as
- follows.
-
- Generate an intermediate index by word, looking something like this:
-
- ...
- bond 43 45 79 402 406 768
- bondage 43 402 612
- ...
- Presbyterian 509 993
- ...
-
- (Use Perl associative arrays; it shouldn't be too hard, if perhaps a bit
- memory-intensive -- I give MacPerl 8M+ for scripts like this. I can
- e-mail you sample Perl scripts if you like.)
-
- Generate a hash table from this. You might use a "dumb" hash table that
- doesn't resolve collisions: say that "bondage" and "Presbyterian" hash to
- the same slot, which points to a sequence of entry numbers that is the
- union of those for "bondage" and "Presbyterian":
-
- 43 45 79 402 406 509 768 993
-
- Read all those entries and discard the ones that don't contain the word
- "bondage".
-
- You'll need an index that takes you from the entry number to its byte
- offset within the file. Again, use Perl to create it:
-
- 1 : 0
- 2 : 409
- 3 : 745
- 4 : 1074
-
- Munge this into a Rez resource rescription (excuse my questionable syntax here):
-
- resource 'EDAT' 2000 {
- 0, 409, 745, 1074, ...
- };
-
- or to an array declaration in C, or whatever.
-
- Lastly, put all of this stuff into "chunks" in the data fork of a file:
-
- 1 <header, including offsets of chunks 2-5>
- 2 <hash table proper: each entry is an offset into chunk 3>
- 3 <entry number sequences referenced by the hash table>
- 4 <table mapping entry numbers to entry offsets within chunk 5>
- 5 <the thesaurus data itself>
-
- Keep chunks 1, 2, 4 and (if you can) 3 in memory. You might want a way to
- keep often-used parts of chunk 5 in memory, too, but that complicates
- things a little. (Hint: use a simple cache, an array of pointers to, say,
- the 16 most recently used entries. For people who like to "branch"
- through thesauri, this should make things much quicker.)
-
- Et voila` (waving hands) you have a usable thesaurus! Well, more or
- less. The only really hard part should be the hash table: I've made
- collision resolution sound easy (by ignoring it), but you will run into
- problems if you're not careful. Try using a variety of hash functions and
- pick the one that yields the best results.
-
- Here's a pseudo-C skeleton of the look-up algorithm:
-
- Boolean LookUp (String word, StringArray results)
- {
- /* Return each entry that contains the specified word */
- UnsignedInt h, entryNum;
- Ptr p;
- String entry;
- Boolean success = false;
-
- h = HashWord (word) mod kHashTableSize;
- for (each entryNum in slot h of the hash table) {
- ReadEntryIntoString (entryNum, entry);
- if (entry contains word) {
- AddStringToArray (results, entry);
- success = true;
- }
- }
- return success;
- }
-
- This isn't the most elegant way to do it, but it's simple. (Tempting
- alternative: use a separate resource for each entry. Don't do it: it may
- sound good, but it looks like gross Resource Manager abuse to me. A
- thousand resources is an awful lot.)
-
- I love the smell of Perl-generated C code in the morning!
-
- Good luck,
-
- Paul.
-
- --
- Paul Hoffman | mailto:paul.hoffman@umich.edu
- Taubman Medical Lib. | http://www-personal.umich.edu/~nkuitse/
- Univ. of Michigan | "Dragons do not enter into this message"
-